132 research outputs found

    Shape-IT: new rapid and accurate algorithm for haplotype inference

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.</p> <p>Results</p> <p>Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.</p> <p>Conclusion</p> <p>Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.</p

    Approches bioinformatiques pour l'exploitation des données génomiques

    Get PDF
    Les technologies actuelles permettent d'explorer le génome entier pour identifier des variants génétiques associés à des phénotypes particuliers, notamment de maladies. C est le rôle de la bioinformatique de répondre à cette problématique. Dans le cadre de cette thèse, un nouvel outil logiciel a été développé qui permet de mesurer avec une bonne précision le nombre de marqueurs génétiques effectivement indépendants correspondant à un ensemble de marqueurs génotypés dans une population donnée. Cet algorithme repose sur la mesure de l entropie de Shannon contenue au sein de ces marqueurs, ainsi que des niveaux d information mutuelle calculés sur les paires de SNPs choisis au sein d une fenêtre de SNPs consécutifs, dont la taille est un paramètre du programme. Il a été montré que ce nombre de marqueurs indépendants devient constant dès que la population est homogène avec une taille suffisante (N > 60 individus) et que l'on utilise une fenêtre assez grande (taille > 100 SNPs). Ce calcul peut avoir de nombreuses applications pour l'exploitation des données.Une analyse génome-entier a été réalisée sur le photo-vieillissement. Elle a porté sur 502 femmes caucasiennes pour lesquelles un grade de photo-vieillissement a été évalué selon une technologie bien établie. Les femmes ont été génotypées sur des puces Illumina OmniOne (1M SNPs), et deux gènes ont été identifiés (STXBP5L et FBX040) associés à un SNP passant le seuil de Bonferroni, dont l'implication dans le photo-vieillissement était jusqu'alors inconnue. De plus, cette association a aussi été retrouvé dans deux autres phénotypes suggérant un mécanisme moléculaire commun possible entre le relâchement cutané et les rides. On n'observe pas de réplication au niveau du critère lentigines, la troisième composante étudiée du photo-vieillissement.Ces travaux sont en cours de publication dans des revues scientifiques internationales à comité de lecture.New technologies allow the exploration of the whole genome to identify genetic variants associated with various phenotypes, in particular diseases. Bioinformatics aims at helping to answer these questions. In the context of my PhD thesis, I have first developed a new software allowing to measure with a good precision the number of really independent genetic markers present in a set of markers genotyped in a given population. This algorithm relies on the Shannon's entropy contained within these markers and on the levels of mutual information computed from the pairs of SNPs chosen in a given window of consecutive SNPs, the window size is a parameter of the program. I have shown that the number of really independent markers become stable as soon as the population is homogeneous and large enough (N > 60) and as soon as the window size is large enough (size > 100 SNPs). This computation may have several applications, in particular the diminution of the Bonferroni threshold by a factor that may reach sometimes 4, the latter having little impact in practice.I have also completed a genome-wide association study on photo-ageing. This study was performed on 502 Caucasian women characterized by their grade of photo-ageing, as measured by a well-established technology. In this study, the women were genotyped with OmniOne Illumina chips (1M SNPs), and I have identified two genes (STXBP5L et FBX040) associated with a SNP that passes the Bonferroni threshold, whose implication in photo-ageing was not suspected until now. Interestingly, this association has been highlighted with two other phenotypes which suggest a possible common molecular mechanism between sagging and wrinkling. There was no replication for the lentigin criteria, the third component studied of photo ageing.These studies are on the process to be published in international peer-reviewed scientific journals.PARIS-CNAM (751032301) / SudocSudocFranceF

    Evaluation et application de méthodes de criblage in silico

    Get PDF
    Lors de la conception de médicaments, le criblage in silico est de plus en plus utilisé et lesméthodes disponibles nécessitent d'être évaluées. L'évaluation de 8 méthodes a mis enévidence l'efficacité des méthodes de criblage in silico et des problèmes de construction de labanque d'évaluation de référence (DUD), la conformation choisie pour les sites de liaisonn'étant pas toujours adaptée à tous les actifs. La puissance informatique actuelle le permettant,plusieurs structures expérimentales ont été choisies pour tenter de mimer la flexibilité dessites de liaison. Un autre problème a été mis en évidence : les métriques d'évaluation desméthodes souffrent de biais. De nouvelles métriques ont donc été proposées, telles queBEDROC et RIE. Une autre alternative est proposée ici, mesurant la capacité prédictive d'uneméthode en actifs. Enfin, une petite molécule active sur le TNFa in vitro et in vivo sur souris aété identifiée par un protocole de criblage in silico. Ainsi, malgré le besoin d'amélioration desméthodes, le criblage in silico peut être d'un important soutien à l'identification de nouvellesmolécules a visée thérapeutique.Since the introduction of virtual screening in the drug discovery process, the number ofvirtual screening methods has been increasing and available methods have to be evaluated.In this work, eight virtual screening methods were evaluated in the DUD database, showingadequate efficiency. This also revealed some shortcomings of the DUD database as thebinding site conformation used in the DUD was not relevant for all the actives.As computational power now permits to address this issue, classical docking runs have beenperformed on several X-ray structures, used to represent the binding site flexibility. This alsorevealed that evaluation metrics show some biases. New evaluation metrics have thus beenproposed, e.g. BEDROC and RIE. An alternative method was also proposed usingpredictiveness curves, based on compound activity probabilityFinally, a virtual screening procedure has been applied to TNFa. A small molecule inhibitor,showing in vitro and in vivo activity in mice, has been identified. This demonstrated the valueof virtual screening for the drug discovery process, although virtual screening methods needto be improved.PARIS-CNAM (751032301) / SudocSudocFranceF

    Computation of haplotypes on SNPs subsets: advantage of the "global method"

    Get PDF
    BACKGROUND: Genetic association studies aim at finding correlations between a disease state and genetic variations such as SNPs or combinations of SNPs, termed haplotypes. Some haplotypes have a particular biological meaning such as the ones derived from SNPs located in the promoters, or the ones derived from non synonymous SNPs. All these haplotypes are "subhaplotypes" because they refer only to a part of the SNPs found in the gene. Until now, subhaplotypes were directly computed from the very SNPs chosen to constitute them, without taking into account the rest of the information corresponding to the other SNPs located in the gene. In the present work, we describe an alternative approach, called the "global method", which takes into account all the SNPs known in the region and compare the efficacy of the two "direct" and "global" methods. RESULTS: We used empirical haplotypes data sets from the GH1 promoter and the APOE gene, and 10 simulated datasets, and randomly introduced in them missing information (from 0% up to 20%) to compare the 2 methods. For each method, we used the PHASE haplotyping software since it was described to be the best. We showed that the use of the "global method" for subhaplotyping leads always to a better error rate than the classical direct haplotyping. The advantage provided by this alternative method increases with the percentage of missing genotyping data (diminution of the average error rate from 25% to less than 10%). We applied the global method software on the GRIV cohort for AIDS genetic associations and some associations previously identified through direct subhaplotyping were found to be erroneous. CONCLUSION: The global method for subhaplotyping can reduce, sometimes dramatically, the error rate on patient resolutions and haplotypes frequencies. One should thus use this method in order to minimise the risk of a false interpretation in genetic studies involving subhaplotypes. In practice the global method is always more efficient than the direct method, but a combination method taking into account the level of missing information in each subject appears to be even more interesting when the level of missing information becomes larger (>10%)

    Evidence After Imputation for a Role of MICA Variants in Nonprogression and Elite Control of HIV Type 1 Infection

    Get PDF
    Past genome-wide association studies (GWAS) involving individuals with AIDS have mainly identified associations in the HLA region. Using the latest software, we imputed 7 million single-nucleotide polymorphisms (SNPs)/indels of the 1000 Genomes Project from the GWAS-determined genotypes of individuals in the Genomics of Resistance to Immunodeficiency Virus AIDS nonprogression cohort and compared them with those of control cohorts. The strongest signals were in MICA, the gene encoding major histocompatibility class I polypeptide-related sequence A (P = 3.31 × 10−12), with a particular exonic deletion (P = 1.59 × 10−8) in full linkage disequilibrium with the reference HCP5 rs2395029 SNP. Haplotype analysis also revealed an additive effect between HLA-C, HLA-B, and MICA variants. These data suggest a role for MICA in progression and elite control of human immunodeficiency virus type 1 infectio

    Gene expression profiling reveals a conserved microglia signature in larval zebrafish

    Get PDF
    International audienceMicroglia are the resident macrophages of the brain. Over the past decade, our understanding of the function of these cells has significantly improved. Microglia do not only play important roles in the healthy brain but are involved in almost every brain pathology. Gene expression profiling allowed to distinguish microglia from other macro-phages and revealed that the full microglia signature can only be observed in vivo. Thus, animal models are irreplaceable to understand the function of these cells. One of the popular models to study microglia is the zebrafish larva. Due to their optical transparency and genetic accessibility, zebrafish larvae have been employed to understand a variety of microglia functions in the living brain. Here, we performed RNA sequencing of larval zebrafish microglia at different developmental time points: 3, 5, and 7 days post fertilization (dpf). Our analysis reveals that larval zebrafish microglia rapidly acquire the core microglia signature and many typical microglia genes are expressed from 3 dpf onwards. The majority of changes in gene expression happened between 3 and 5 dpf, suggesting that differentiation mainly takes place during these days. Furthermore, we compared the larval microglia transcriptome to published data sets of adult zebrafish microglia, mouse microglia, and human microglia. Larval microglia shared a significant number of expressed genes with their adult counterparts in zebrafish as well as with mouse and human microglia. In conclusion, our results show that larval zebrafish microglia mature rapidly and express the core microglia gene signature that seems to be conserved across species. K E Y W O R D S brain, evolution, microglia, RNA sequencing, transcriptome, zebrafis
    corecore